11. Intro to Machine Learning: Classification Technique

11.1 Different Categories of Machine Learning Techniques

mlcat

This is a more exhaustive review

What is Machine Learning?

Machine Learning is the ability for a computer to automatically learn and understand without being programmed time and again; using data ou feed in the data having different attributes or features that the algorithms have to understand and give you a decision boundary based on the data you provide it.

What is a Model?

Model is a mathematical representaion of a physical reality; A model is a generalized representation of casuality e.g. y = f(X) that holds generally correct and helps in simulating outcomes in real business situation

11.1.1 Types of Classification Techniques

11.2 Case Study to prove the point

Same Case Study from Week 10

We will use a case study approach for the class to understand steps before building machine learning models to ensure the data is robust for making prodections.

This problem statement from an online education platform where we’ll look at factors that help us select the most promising leads, i.e. the leads that are most likely to convert into paying customers.

Our ultimate goal- We shall use the data from previous leads who did convert to a customer and many who did not to build a model that we can use to score incoming leads for preferential retargeting.

The data dictionary for the data set is here

11.3 Load data and High level review

10.2.1 Review Target Variable

We have no idea what kind of variable it is...

11.4 Introducing Logistic Regression

line

Properties of Logistic Regression:

10.3.1 Sigmoid Function

sigmoid

curve

11.5 Building a Logistic Regession Model

Several Steps needs to be followed for building a classification model

11.5.1 Variables to build model

10.4.1.1 Test for NULLs

10.4.1.2 split into categorical and numerical variables

11.5.2 Categorical Variable cleaning

11.5.3 Numerical Variable cleaning

11.5.4 Categorical Vairable treatment

11.5.4.1 Dummy Variables

10.6 Training the Model

scikitlearn

2 Feature Model

10.6.1 Train Test Split

10.6.2 Fitting a Model

Note:-

Color Scheme Reversed

10.7 Model Evaluation

Classification Models are evaluated with the help of Quality Metrics and not through visual inspection

Multivariate Model

The 2 feature model was built mostly for visualization.

Now using the same formulation we shall build a model with 5 features with the highest correlation

10.7.1 Model Parameters

10.7.2 Confusion Matrix

A confusion matrix is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise

10.7.3 Precision & Recall

qulmet

10.7.3 Receiver Operating Characteristic(ROC)

Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificity

10.8 Important Concepts at a High Level

10.8.1 Overfitting versus Underfitting

ovun

10.8.2 Bias Variance Tradeoff

biasvartradeoff

10.8.3 Cross-Validation

v=xval

10.8.4 Leakage

lekage